Introductory Data Science for Innovation (995N1) – Week 10, December 2021
After having learned about getting data, processing it, working with text (text mining) and understanding data visualisation, this lecture is going to introduce machine learning. You may already have encountered some of the machine learning algorithms we are going to see today.
This lecture aims first at giving you an overview of what machine learning is, giving you a few definitions, what it is generally used for, and in broad lines what are the different types of machine learning that you may encounter.
Second we will delve into a few examples of common algorithms used in machine learning and run a few practical examples in R. We will also discuss some ethical concerns that you should keep in mind when using or implementing such a system.
Machine learning gives computers the ability to acquire their own knowledge, by extracting patterns from data, without the need to give explicit rules / hard-coded knowledge (Ian Goodfellow, Yoshua Bengio and Courville 2016). The computer will learn from experience.
Machine learning uses computer agorithms to solve problems through the building of models and in turn allows it also to take decisions.
“Every algorithm has an input and an output: the data goes into the computer, the algorithm does what it will with it, and out comes the result. Machine learning turns this around: in goes the data and the desired result and out comes the algorithm that turns one into the other. Learning algorithms -also known as learners- are algorithms that make other algorithms.”
Before machine learning, programmers had to instruct rules so that the computer could follow a step by step guide on how to work with data.
Machine learning involves building models from data. The advance here, is that the model properties may not be known by the programmer, but that the algorithm will make a model using the properties of the data.
How can the algorithm learn from data?
Here are the most common types of machine learning techniques:
There are many ways in which a model can be built using data.
How can the algorithm learn from data? Here are the most common types of machine learning techniques:
First, we will explore in more details what is meant by supervised and unsupervised learning.
Supervised learning: the machine builds a model from labelled examples. This enables the predicition of features for other unlabelled examples. This type of machine learning is used mainly to sort examples in different categories (classification purposes), or predict a target (numeric value) based on a set of features.
Unsupervised learning: This learning is done without labelled examples; the machine learns from the known features of the data. The algorithm try to either group together examples which are similar (i.e. clustering), or understand differentiating features of the data, or reduce the number of variables.
After the ML sorted the data, some labelling may still be required, but these are done post-learning.
How can the algorithm learn from data?
In the next part of the lecture, we are going to explore in more details what supervised and unsupervised learning entails. While going through some general principles for these two types of machine learning, we will delve into examples used in innovation studies.
We will see some practical examples using in R using text data.
Supervised learning is used when the analyst has a clear idea of the target to be reached. The target is instructed to the machine using data, where one variable acts as the target.
The target can either be a classification, where there can be two (binary) or more (factors) classes; or a regression problem, where the target is a number, and the algorithm will learn by looking how far off it is from the target.
The model will learn from each of the examples given; and to be performant you will need to give it a variety of cases.
The data given to the algorithm is split into two sets, a training set which will be used to build the model, and a test set to evaluate how good the model constructed actually is.
In supervised learning, there are five main steps to follow:
In this week’s exercise we will go more in depth into supervised learning. We will start from an example code of a supervised learning algorithm, which we will try to improve upon.
We will iteratively alter parameters, such as adjusting the training data, changing the algorithm to understand how the performance of the model is changing. Do open your workbook now, and we will start to work through the workbook together.
There are two main packages for machine learning in R.
These are the caret, the mlr and more recently the tidymodels packages.
For our examples of supervised learning, we are going to use latest version of the mlr (i.e. the mlr3) package, as it is simple to use and has a large range of models.
To examplify the process of supervised machine learning, we are going to use some data from the Cancerscreen project. The problem consists of checking whether a given paper is about diagnostics or not.
To classify the data, we used the MeshTerm classification, where we identify terms which are sepcific to diagnostics.
Ultimately the model would be used on data retrieved from bibliometric datasets, such as the Web of Science and Scopus and does not benefit from a classification.
# 1. Explore data
Medline <- readRDS("Diag_full.rds")
Medline <- Medline %>%
select(-diag_type, -case)%>%
unique()
summary(Medline)
## PMID T_A diagnosis ## Min. :22143403 Length:26751 N:11636 ## 1st Qu.:25630780 Class :character Y:15115 ## Median :25937533 Mode :character ## Mean :25928773 ## 3rd Qu.:26268906 ## Max. :28590602
We have three columns:
PMID : Pubmed ID for each of the document
T_A: text - combination of the title and abstract diagnosis: is the paper about diagnosis
Let’s have a look at the data first:
Medline <- as_tibble(Medline) glimpse(Medline)
## Rows: 26,751 ## Columns: 3 ## $ PMID <int> 22143403, 22246415, 22467405, 22548776, 23188650, 23188879, … ## $ T_A <chr> "Bayesian analysis on meta-analysis of case-control studies … ## $ diagnosis <fct> N, N, Y, N, N, N, N, N, Y, N, Y, Y, Y, Y, Y, N, N, Y, N, Y, …
For supervised machine learning we will need examples, a target and features. The dataset is organised into observations (by articles represented by PMIDs),
we have a target which is the variable diagnosis,
but how do we transform the text into features?
We need first to normalise the text data…
library(tidytext)
library(tm)
library(SnowballC)
Stop_w<- c(stopwords("en"), "introduction", "conclusion","objective", "aim", "methods", "results", "conclusions","background", "percent", "may", "use", "used", "however", "p", "cancer", "study", "lung", "prostate","prostatic", "patient", "colorectal")
Medline$T_A <- removeNumbers(Medline$T_A)
Medline$T_A <- tolower(Medline$T_A)
Medline$T_A <- str_replace_all(Medline$T_A, "[^[:alnum:]]", " ")
Medline$T_A <- removeWords(Medline$T_A, Stop_w)
Medline$T_A <- gsub('\\b\\w{1,2}\\b', '', Medline$T_A) # remove words of two letters or under
Then we are transforming the data into a vector of features using tf-idf:
# reduce the size of the data to seed up the routine Tidy_Med <- Medline [1:3000,] # extract token and use stemming. Tidy_Med <- Medline %>% unnest_tokens(word, T_A) %>% mutate(word = wordStem(word, language="english")) # compute word counts Tidy_Med <- Tidy_Med %>% count(PMID, word, sort=T) # compute tf_idf Tidy_Med <- Tidy_Med %>% bind_tf_idf(word, PMID, n)
Then we are grouping words together, the words will become the features:
# 4. Get the word as features in columns # save the diagnostics features separately diag <- Medline %>% select(PMID, diagnosis)%>% unique() # Create a list of word to keep which are higher than a certain threshold word <- Tidy_Med %>% group_by(word) %>% summarize(n_doc = n())%>% filter(n_doc>39)
In order to reduce the number of features we will only select words which are present in 2% of the documents (i.e. 40 documents).
We also saved the target in a separate document to not loose it.
Finally, the word features will become columns (to be considered as variables):
# only select the words over the threshold
Tidy_Med <- inner_join(Tidy_Med, word)%>%
select(-n_doc)
#make words into columns/features
Tidy_Med <- Tidy_Med %>%
select(-c(n, tf, idf))%>%
pivot_wider(names_from = word,
values_from = tf_idf,
values_fill=list(tf_idf = 0))
# add back the diagnosis column
Tidy_Med <- inner_join(diag, Tidy_Med)
We can see now that we have each word being one column, which represent the features and tf-idf score as value. We will use these words as features to train the model…
head(Tidy_Med, n=6)
## # A tibble: 6 × 3,979 ## PMID diagnosis data faecal akt imag cell pet dose degre alk ## <int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 25315432 N 0.0200 0 0 0 0.0102 0 0 0 0 ## 2 25572476 N 0 0 0 0 0.0623 0 0 0 0 ## 3 25590835 Y 0 0 0 0 0 0 0 0 0 ## 4 25898986 N 0 0 0 0 0.0340 0 0 0 0 ## 5 25830422 Y 0 0 0 0 0 0 0 0 0 ## 6 26390591 N 0 0 0 0 0 0 0 0 0 ## # … with 3,968 more variables: solid <dbl>, load <dbl>, signific <dbl>, ## # year <dbl>, hpv <dbl>, patient <dbl>, mir <dbl>, plan <dbl>, ## # metabolit <dbl>, stat <dbl>, stent <dbl>, coupl <dbl>, valu <dbl>, ## # case <dbl>, biopsi <dbl>, group <dbl>, uptak <dbl>, adenoma <dbl>, ## # experi <dbl>, type <dbl>, weight <dbl>, recommend <dbl>, ros <dbl>, ## # biomark <dbl>, contour <dbl>, polyp <dbl>, screen <dbl>, tissu <dbl>, ## # diseas <dbl>, stage <dbl>, mirna <dbl>, intervent <dbl>, fdg <dbl>, …
To avoid problems with headers we are going to rename them, before inputing the features into the machine learning algorithm, but keeping the columns names of the PMID and the diagnosis.
# 5. Tidy up columns Tidy_up <- Tidy_Med names(Tidy_up)[3:ncol(Tidy_up)]<- as.character(seq(1, ncol(Tidy_up)-2, by=1)) colnames(Tidy_up) <- make.names(colnames(Tidy_up),unique = T) Tidy_up$diagnosis <- as.factor(Tidy_up$diagnosis)
Let’s start to set up our learner:
library(mlr3)
library(mlr3learners)
# set up the classifier task using the right data and target
task = TaskClassif$new(id="PMID", backend=Tidy_up, target="diagnosis")
# define the algorithm to be used
learner = lrn("classif.rpart")
In task we define a new task, which is a classification task. If you want to target a number you would use TaskRegr.
Then you setup your learner, in which you decide the type of algorithm you would like to use.
Here we use a classification tree, but more are available.
Let’s set our learner with part of the data:
# 7. Train the model # train a model of this learner for a subset of the task learner$train(task, row_ids = 1:300) # this is what the decision tree looks like learner$model
## n= 300 ## ## node), split, n, loss, yval, (yprob) ## * denotes terminal node ## ## 1) root 300 142 Y (0.4733333 0.5266667) ## 2) X15< 0.01194675 207 84 N (0.5942029 0.4057971) ## 4) X36< 0.04466631 195 72 N (0.6307692 0.3692308) ## 8) X1697< 0.00562791 185 62 N (0.6648649 0.3351351) ## 16) X1097< 0.01387616 168 49 N (0.7083333 0.2916667) ## 32) X220< 0.0176174 160 42 N (0.7375000 0.2625000) ## 64) X1789< 0.007216191 150 34 N (0.7733333 0.2266667) ## 128) X970< 0.01890428 143 28 N (0.8041958 0.1958042) * ## 129) X970>=0.01890428 7 1 Y (0.1428571 0.8571429) * ## 65) X1789>=0.007216191 10 2 Y (0.2000000 0.8000000) * ## 33) X220>=0.0176174 8 1 Y (0.1250000 0.8750000) * ## 17) X1097>=0.01387616 17 4 Y (0.2352941 0.7647059) * ## 9) X1697>=0.00562791 10 0 Y (0.0000000 1.0000000) * ## 5) X36>=0.04466631 12 0 Y (0.0000000 1.0000000) * ## 3) X15>=0.01194675 93 19 Y (0.2043011 0.7956989) ## 6) X1125>=0.006552069 22 11 N (0.5000000 0.5000000) ## 12) X1132< 0.004224513 9 1 N (0.8888889 0.1111111) * ## 13) X1132>=0.004224513 13 3 Y (0.2307692 0.7692308) * ## 7) X1125< 0.006552069 71 8 Y (0.1126761 0.8873239) *
But in general you would use 60-80% of data to train the algorithm.
Let’s use this model to predict the diagnosis label of our test data:
# 8. Use the test set to see how the model is classifying the test data predictions = learner$predict(task, row_ids = 301:350) # look at the first 10 rows of the predictions head(as.data.table(predictions), n=10)
## row_ids truth response ## 1: 301 Y N ## 2: 302 N N ## 3: 303 Y N ## 4: 304 N N ## 5: 305 Y N ## 6: 306 Y Y ## 7: 307 N N ## 8: 308 Y N ## 9: 309 Y N ## 10: 310 Y Y
How good is our model?
# 9. accuracy of our model on the test set
predictions$score(msr("classif.acc"))
## classif.acc ## 0.64
# 10. look at the confusion matrix predictions$confusion
## truth ## response N Y ## N 14 15 ## Y 3 18
While the accuracy may give an indication about how good your model is, there are situations in which it may not be enough, the false positive (type I error), or the false negative (type II error) can be a good complement to it.
If you have an unbalanced dataset (90% of diagnostic publications), your classifier can guess diagnostic for all the papers, and it would be 90% accurate.
In some cases you would rather have a model which would minimise the false negative compared to the true positive. If you do some covid testing for choosing who should self isolate, you would rather make an error where you diagnose someone with covid, who is not really sick and ask them to stay at home. There may be cases in which you may want a less accurate model, but minimise one of the error type.
For more information on how to make more indicators with the confusion matrix, you can look at measures such as sensitivity, specificity, precision, recall … (look at Han, Kamber, and Pei (2012), pp. 366-370)
Let’s try to improve upon the model I have just started. Let’s start working with the first exercise.
Unsupervised learning is used for more exploratory purposes, looking at features in the data, and group them into classes.
There is no given target, the algorithm finds similiarty in the observations or underlying relevant features. The analyst may need to input some parameters, such as the number of clusters to be found, and usually needs to identify/label clusters after the algorithm has done the grouping.
You have already seen some examples of unsupervised learning in previous classes, such as topic models, or co-word analysis.
Topic models builds on the asumption that there are multiple topics within each document; it builds on the assumption about statistical distribution of word /frequency within and between documents.
There are several algorithms to build topic models, LDA (Latent Dirichlet allocation), LSA or LSI (Latent Semantic Analysis), NNMF (Non-Negative matrix factorisation) which are the most popular ones. They normally take normalized and vectorized text document as an input (e.g. Tf-Idf form), though you still have to enter parameters such as the number of topics to be computed from the variety of documents in use.
Word embedding models have become increasingly popular for text analysis. Its main use is to understand the relationship between words in a corpus. After training the model, you can explore words that are the most similar to a given word (i.e. synonyms), antonyms and much more.
Word embeddings became popular after the implementation of Mikolov et al. (2013) called Word2Vec. Together with GloVe developped by (Pennington, Socher, and Manning 2014), these are the two most popular word embedding algorithms used nowadays.
One of the main novelty of the algorithm is that it uses surrounding words of each token as labels to train the model. It is therefore ambiguous as to whether it is a unspervised or supervised model.
Using text data from SPRU publications (SPRU history project) gathered from the Web of Science, Scopus and SRO, I have pre-processed the corpus to integrate the title and abstract of each observation in one field, used lower cases, remove special character and punctuation, and merged the results in a dataset. Let’s load the dataset and run a GloVe model on it using the text2vec package.
rm(list=ls())
library(text2vec)
library(tidyverse)
spru <- read_csv("SPRU_publi.csv")
head(spru)
## # A tibble: 6 × 1 ## text ## <chr> ## 1 episodic innovation r d strategies for project based environments many busine… ## 2 innovation in low tech industries na ## 3 exploring the capital goods economy complex product systems in the uk over th… ## 4 governance and co ordination of distributed innovation processes patterns of … ## 5 what do we know about innovation in this editorial article we introduce this … ## 6 world sustainable development outlook west meet east sharing the past and cur…
First, we need to transform the text into tokens. Then, we are going to make a vocabulary using all the tokens that has been identified from the corpus. We will also remove the rarer terms, as these won’t have enough training data to be useful for the model.
# Create iterator over tokens tokens <- space_tokenizer(spru$text) # Create vocabulary. Terms will be unigrams (simple words). it = itoken(tokens, progressbar = FALSE) vocab <- create_vocabulary(it) # Remove tokens used less than 5 times vocab <- prune_vocabulary(vocab, term_count_min = 5L)
In the next step, we first vectorise the vocabulary, which then will be used to construct the term-co-occurence matrix (tcm), using the skip-gram model. For this model each word is represented using the five words before and the five words after each given token.
# Use our filtered vocabulary vectorizer <- vocab_vectorizer(vocab) # use window of 5 for context words tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)
Let’s create a model based on 50 features and use 10 iterations to improve the model. Note that the model is based on two matrices, the main matrix and the context matrix.
glove = GlobalVectors$new(rank = 50, x_max = 10) wv_main = glove$fit_transform(tcm, n_iter = 10)
## INFO [19:35:12.382] epoch 1, loss 0.2569 ## INFO [19:35:13.576] epoch 2, loss 0.1713 ## INFO [19:35:15.471] epoch 3, loss 0.1447 ## INFO [19:35:16.713] epoch 4, loss 0.1290 ## INFO [19:35:17.687] epoch 5, loss 0.1179 ## INFO [19:35:18.643] epoch 6, loss 0.1096 ## INFO [19:35:19.505] epoch 7, loss 0.1030 ## INFO [19:35:20.310] epoch 8, loss 0.0977 ## INFO [19:35:21.246] epoch 9, loss 0.0934 ## INFO [19:35:22.219] epoch 10, loss 0.0897
wv_context = glove$components word_vectors = wv_main + t(wv_context)
rank: number of features in the model x_max: maximum number of co-occurences to use in the matrix n_iter: number of iteration to optimise the model
Let’s try our model out on two words and what words would be the most similar to them. Let’s take the word firm:
firm = word_vectors["firm", , drop = FALSE] cos_sim = sim2(x = word_vectors, y = firm, method = "cosine", norm = "l2") head(sort(cos_sim[,1], decreasing = TRUE), 5)
## firm growth performance specific level ## 1.0000000 0.6151884 0.6128337 0.6075509 0.5843449
Clustering techniques are more traditional unsupervised learning techniques (for any types of problems), a little less used for text mining than the above.
Clustering is the process of partioning a set of observations into subsets (i.e. classes or clusters). In each cluster, the observations are similar to one another, while dissimilar to observation in other clusters. (Han, Kamber, and Pei 2012, 444)
To do clustering analysis, you need to evaluate how similar observations are in your data, and then having an algorithm (i.e. learner), which will group them into cluster; different learners may bring different clustering. To prepare data for the learner you may want to run similarity measures on the observations (e.g. cosine similarity).
There are many other methods within clustering techniques to identify clusters using statistical, grid based methods, and many more algorithms.
The above just provides you with an overview of what is feasible.
The model that you developped is as good as your training data. The training data can overrepresent or underrepresent certain categories / characteristics. Thus the resulting model when implemented may not work well for these undererpresented examples.
One example was the automatic labelling of objects in pictures.You may include sensitive or data about protected characteristics of individuals.
Even if you take care of not including these, other characteristics may be highly correlated with protected characteristics and become discriminatory.
Machine learning may also be rolled out on a variety of uses over time; so the data used to build a model may be fit for purpose at the time of development, but may not be representative of new cases that you encounter and want to generalise from.
Using machine learning to make decision may raise the question of accountability and transparency, how did we reach the decision, what particularly lead to the decision or the outcome?
Some popular machine learning algorithms, such as neural networks / deep learning, don’t have a transparent model and hence it may be difficult to understand ‘why’ it has recommended a specific decision.
Blei, David M. 2012. “Probabilistic topic models.” In Communications of the Acm. https://doi.org/10.1145/2133806.2133826.
Ciarli, Tommaso, and Ismael Ràfols. 2019. “The relation between research priorities and societal demands: The case of rice.” Research Policy 48 (4): 949–67. https://doi.org/https://doi.org/10.1016/j.respol.2018.10.027.
Domingos, Pedro. 2015. “The Master Algorithm.” In Basic Books.
Géron, Aurelien. 2019. Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems.
Han, Jiawei., Micheline. Kamber, and Jian Pei. 2012. Data mining : concepts and techniques. Elsevier/Morgan Kaufmann.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. 1-2. MIT Press. https://doi.org/10.1007/s10710-017-9314-z.
Lang, Frédérique, and Jane Pujols. 2016. “A SPRU history based on bibliometric analysis of the studies of SPRU PhD students.” Brighton: SPRU, Business School, University of Sussex.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient estimation of word representations in vector space.” In 1st International Conference on Learning Representations, Iclr 2013 - Workshop Track Proceedings. International Conference on Learning Representations, ICLR.
Pennington, Jeffrey, Richard Socher, and Christopher D Manning. 2014. “GloVe: Global Vectors for Word Representation.” In, 1532–43. http://nlp.
Rhys, Hefin Ioan. 2020. MACHINE LEARNING WITH R, TIDYVERSE, AND MLR. O’REILLY MEDIA.